AR100

From linux-sunxi.org
Jump to navigation Jump to search

The AR100, also called the CPUS, arisc, or ARISC in SoC documentation, is a coprocessor present in the A31 and newer sunxi SoCs (including the popular H3 and most 64-bit chips). It is not another ARM core, but instead uses the 32-bit OpenRISC 1000 instruction set architecture.

Allwinner releases a closed-source firmware blob for the AR100 as part of their BSP. This blob provides power management services to software running on the ARM CPUs, such as Linux and U-Boot. It also implements deep power-saving modes ("super standby") for the BSP kernel. The AR100 is not currently used for anything on mainline Linux, as power management there is implemented using native drivers. A few projects have begun to write free firmware for the AR100, using it for power management or as an independent microcontroller.

Hardware

While the name "AR100" refers only to the OpenRISC CPU core, the processor is tightly integrated with other "RTC block" hardware. In general, any device whose name begins with "R_" is intended to be controlled by the AR100. This includes the R_PIO, R_PRCM, and several timers. This also includes the R_CIR infrared receiver, so a remote control can be used to wake the SoC from deep sleep.

CPU core

The AR100 is based on the OR1200 implementation of the OpenRISC 1000 architecture. This is an open-source CPU; Verilog source code is available at https://github.com/openrisc/or1200. The AR100 CPU reports that it is hardware revision 1 in its "Version Register" SPR (below). The revision was changed to 8 in commit 31c7fde6, so that means the AR100 hardware cannot be based on any OR1200 commit newer than 4a4a9675. Thus, the AR100 is a very old design with several known bugs, some of which are detailed below.

Instruction set

The OpenRISC architecture is very flexible, with many optional features. The AR100 only supports the 32-bit base instruction set ("ORBIS32"), so it has no floating-point or vector arithmetic instructions. Even within the ORBIS32, several instructions are optional. Running an unimplemented instruction should cause an "Illegal Instruction" exception, but this is not always the case due to bugs. Optional instruction support is described in the following table:

Instruction Description Implemented? Functional?
l.cmov Conditional Move Yes[1] Yes[2] (but broken when icache is enabled)
l.csync Context Synchronization Yes[2] Yes (used by the Allwinner blob)
l.cust[1-8] Custom Instructions No[2] N/A
l.div Divide Signed Yes[1][2] No[2]
l.divu Divide Unsigned Yes[1][2] No[2]
l.extbs Extend Byte with Sign Yes[1][2] No[2]
l.extbz Extend Byte with Zero Yes[1][2] No[2]
l.exths Extend Half Word with Sign Yes[1][2] No[2]
l.exthz Extend Half Word with Zero Yes[1][2] No[2]
l.extws Extend Word with Sign Yes[1][2] No[2]
l.extwz Extend Word with Zero Yes[1][2] No[2]
l.ff1 Find First 1 Yes[1][2] Yes[2]
l.fl1 Find Last 1 Yes[1][2] No[2] (functions same as l.ff1)
l.lwa Load Single Word Atomic Unknown Unknown
l.mac Multiply and Accumulate Signed Yes[1] Unknown
l.maci Multiply Immediate and Accumulate Signed Yes[1] Unknown
l.macrc MAC Read and Clear Yes[1] Unknown
l.macu Multiply and Accumulate Unsigned Unknown Unknown
l.msb Multiply and Subtract Signed Unknown Unknown
l.msbu Multiply and Subtract Unsigned Unknown Unknown
l.msync Memory Synchronization Yes[2] Unknown
l.mul Multiply Signed Yes[1][2] Yes[2] (used by the Allwinner blob)
l.muld Multiply Signed to Double Unknown Unknown
l.muli Multiply Immediate Signed Yes[1][2] No[2] (returns indeterminate value)
l.mulu Multiply Unsigned Yes[1][2] No[2] (always returns the value in rB[2])
l.mulud Multiply Unsigned to Double Unknown Unknown
l.psync Pipeline Synchronization Yes[2] Unknown
l.ror Rotate Right Yes[1][2] No[2]
l.rori Rotate Right with Immediate Yes[1][2] No[2]
l.sfeqi Set Flag if Equal Immediate Yes[2] Yes[2]
l.sfgesi Set Flag if Greater Than or Equal to Immediate Signed Yes[2] Yes[2]
l.sfgeui Set Flag if Greater Than or Equal to Immediate Unsigned Yes[2] Yes[2]
l.sfgtsi Set Flag if Greater Than Immediate Signed Yes[2] Yes[2]
l.sfgtui Set Flag if Greater Than Immediate Unsigned Yes[2] Yes[2]
l.sflesi Set Flag if Less Than or Equal to Immediate Signed Yes[2] Yes[2]
l.sfleui Set Flag if Less Than or Equal to Immediate Unsigned Yes[2] Yes[2]
l.sfltsi Set Flag if Less Than Immediate Signed Yes[2] Yes[2]
l.sfltui Set Flag if Less Than Immediate Unsigned Yes[2] Yes[2]
l.sfnei Set Flag if Not Equal Immediate Yes[2] Yes[2]
l.srai Shift Right Arithmetic with Immediate Yes[2] Yes[2]
l.swa Store Single Word Atomic Unknown Unknown
l.trap Trap Yes[2] Unknown

See the toolchain section below for how this instruction support translates into GCC flags.

OR1200 features

The OR1200 itself is also very configurable. Bits in the "Unit Present Register" and other SPRs describe which features are available. In summary:

  • General purpose registers: 32
  • Instruction set(s) supported: ORBIS32
  • Delay slot: 1 present
  • Byte ordering: big-endian only (but see Memory below)
  • Instruction cache: 4KiB, one-way, physically tagged, 16-byte cache blocks
  • Data cache: not present
  • MMU: not present
  • Multiply-Accumulate (MAC) unit: present
  • Debug unit: present
  • Performance counters: not present
  • Power management: present (broken)
  • Programmable interrupt controller: present (broken)
  • Tick timer: present
  • FPU: not present

SPR data

The OpenRISC 1000 architecture defines several special-purpose registers, or SPRs. Some of the informational ones are detailed here:

SPR Name Value Interpretation
SPR_VR Version Register 0x12000001
Revision: 1
Updated Version Registers: not present
Configuration Template: 0x0
Version: 0x12
SPR_UPR Unit Present Register 0x00000765
Unit Present Register: present
Data Cache: not present
Instruction Cache: present
Data MMU: not present
Instruction MMU: not present
MAC: present
Debug Unit: present
Performance Counters Unit: not present
Power Management: present
Programmable Interrupt Controller: present
Tick Timer: present
Custom Units: none present
SPR_CPUCFGR CPU Configuration Register 0x00000020
Number of Shadow GPR Files: 0
Custom GPR File: no
ORBIS32 Supported: yes
ORBIS64 Supported: no
ORFPX32 Supported: no
ORFPX64 Supported: no
ORVDX64 Supported: no
Delay Slot: yes
Architecture Version Register: not present
Exception Vector Base Address Register: not present
Implementation-Specific Registers (ISR0-7): not present
Arithmetic Exception Control/Status Registers: not present
SPR_DCCFGR Data Cache Configuration Register 0x00002600
Number of Cache Ways: 1
Number of Cache Sets: 1
Cache Block Size: 16
Cache Write Strategy: WB
Cache Control Register Implemented: yes
Cache Block Invalidate Register Implemented: yes
Cache Block Prefetch Register Implemented: no
Cache Block Lock Register Implemented: no
Cache Block Flush Register Implemented: yes
Cache Block Write-back Register Implemented: no
SPR_ICCFGR Instruction Cache Configuration Register 0x00002640
Number of Cache Ways: 1
Number of Cache Sets: 256
Cache Block Size: 16
Cache Control Register Implemented: yes
Cache Block Invalidate Register Implemented: yes
Cache Block Prefetch Register Implemented: no
Cache Block Lock Register Implemented: no

Note: the DCCFGR values are not meaningful, since the data cache is not implemented.

Memory

Byte swapping/endianness

While the CPU itself is big-endian, the data bus coming out of the CPU is byte swapped. This makes 32-bit memory access appear to be "little-endian", as each group of 4 memory bytes is reversed. This is extremely convenient, as it allows the bits in the MMIO register definitions from the SoC manual to be used as is. However, 8 or 16-bit memory reads/writes will access the wrong data (see the table below), so transfering strings or small integers between the ARM world and the AR100 requires swapping bytes. For this reason, MMIO access from the AR100 must always use 32-bit loads and stores.

Data type C representation In memory (hex) ARM CPU interpretation AR100 interpretation
32-bit integer 0x12345678 78 56 34 12 0x12345678 0x12345678
16-bit integers 0xabcd, 0x1234 cd ab 34 12 0xabcd, 0x1234 0x1234, 0xabcd
Characters 'R', 'I', 'S', 'C' 52 49 53 43 'R', 'I', 'S', 'C' 'C', 'S', 'I', 'R'

Byte swapping also affects the AR100's instruction stream. The toolchain writes instructions in big-endian byte order, and the AR100 CPU expects to read them in big-endian byte order. However, due to the byte swapping, if the instructions are stored in SRAM as-is, they will be read by the CPU as little-endian, and they will not run. To solve this, the instructions must be reversed before writing them to SRAM; they will be un-reversed when read by the AR100. This can be done using objcopy when creating a binary firmware image:

${CROSS_COMPILE}objcopy -O binary --reverse-bytes 4 firmware.elf firmware.bin

Byte-invariant Ranges

On at least on the A64 and H6, four ranges of addresses can be configured as byte-invariant. This means that 8-bit and 16-bit memory accesses will see the same values as the ARM CPUs, which is incredibly useful for sending strings between the processors. The granularity of these ranges is is 1KiB, as the low 10 bits of each start/end register are ignored. The addresses here correspond to the AR100's address space, with SRAM A2 starting at address 0. Note that using these ranges does not remove the need to byte swap the AR100's instruction stream.

Register Use
R_CPUCFG + 0x0c Byte-invariant range enable flags (one bit per range)
R_CPUCFG + 0x10 Byte-invariant range 0 start address
R_CPUCFG + 0x14 Byte-invariant range 0 end address
R_CPUCFG + 0x18 Byte-invariant range 1 start address
R_CPUCFG + 0x1c Byte-invariant range 1 end address
R_CPUCFG + 0x20 Byte-invariant range 2 start address
R_CPUCFG + 0x24 Byte-invariant range 2 end address
R_CPUCFG + 0x28 Byte-invariant range 3 start address
R_CPUCFG + 0x2c Byte-invariant range 3 end address

A31

AR100 address space ARM address space Size Description Notes
Start End Start End
0x00000000 0x00001fff 0x00040000 0x00041fff 8 KiB Exception vectors Only one writable word at each 0x100 boundary
0x00002000 0x00003fff 0x00042000 0x00043fff 8 KiB Reserved
0x00004000 0x00013fff 0x00044000 0x00053fff 64 KiB SRAM A2 One read costs exactly 3 cycles
0x00020000 0x0002ffff 0x00020000 0x0002ffff 64 KiB SRAM B One read costs ~13 cycles for 300 MHz AR100 and 200 MHz AHB1
0x00040000 0x00047fff 0x00000000 0x00007fff 32 KiB SRAM A1 One read costs ~13 cycles for 300 MHz AR100 and 200 MHz AHB1
0x40000000 0xbfffffff 0x40000000 0xbfffffff 2 GiB DRAM One read costs ~53 cycles for 300 MHz AR100 and 360 MHz DRAM

H3

AR100 address space ARM address space Size Description Notes
Start End Start End
0x00000000 0x00001fff 0x00040000 0x00041fff 8 KiB Exception vectors Only one writable word at each 0x100 boundary
0x00002000 0x00003fff 0x00042000 0x00043fff 8 KiB Reserved
0x00004000 0x0000bfff 0x00044000 0x0004bfff 32 KiB SRAM A2 One read costs exactly 3 cycles
0x00040000 0x0004ffff 0x00000000 0x0000ffff 64 KiB SRAM A1 One read costs ~25 cycles for 300 MHz AR100 and 200 MHz AHB1
0x40000000 0xbfffffff 0x40000000 0xbfffffff 2 GiB DRAM One read costs ~60 cycles for 300 MHz AR100 and 672 MHz DRAM

To be investigated: something seems to be weird about the SRAM A1 and DRAM access times in H3 when compared to A31. Maybe the MBUS clock speed makes some difference too?

A64/H5

AR100 address space ARM address space Size Description Notes
Start End Start End
0x00000000 0x00001fff 0x00040000 0x00041fff 8 KiB Exception vectors Only one writable word at each 0x100 boundary
0x00002000 0x00003fff 0x00042000 0x00043fff 8 KiB Reserved
0x00004000 0x00013fff 0x00044000 0x00053fff 64 KiB SRAM A2 One read costs exactly 3 cycles
0x00040000 0x0004ffff 0x00000000 0x0000ffff 64 KiB ARM BROM One read costs ~10 cycles for 300 MHz AR100
0x00050000 0x00053fff 0x00050000 0x00053fff 16 KiB SRAM A2 (again) One read costs ~40 cycles for 300 MHz AR100
0x40000000 0xffffffff 0x40000000 0xffffffff 3 GiB DRAM One read costs ~70 cycles for 300 MHz AR100 and 314 MHz DRAM

The ARM BROM is mapped into the AR100 address space because it was moved into the previous location of SRAM A1, but the ARM/AR100 remapping was not updated. (SRAM A1 was moved to make space for the BROM; the BROM was moved to allow for more than 2 GiB of DRAM address space).

The copy of the end of SRAM A2 mapped at the AR100's 0x00050000 is interesting because its access is so much slower than expected, as if the data is going in a loop through several buses in the SoC back to AHB0.

Clocking

CPUS_CLK_CFG_REG

The CPU clock can be configured with a register referenced as CCMU_CPUS_CFG in the Allwinner sun6i Linux source code. It is documented in the A80 and A83T manuals under R_PRCM as CPUS_CLK_CFG_REG and CPUS_CLK_REG, respectively.

The register is generally the first register in the PRCM, located at 0x01f01400 on H3/A64/H5.

Offset
0x0000
Name Bits R/W Default Values Description
/ 31:18
/
/
CPUS_CLK_SRC_SEL 17:16
RW
01
00: LOSC (32 KHz)
01: HOSC (24 MHz)
10: PLL_PERIPH0/CPUS_POST_DIV
11: IOSC (16 MHz)
Clock source
/ 15:13
/
/
CPUS_POST_DIV 12:8
RW
00000
00000: /1
00001: /2
00010: /3
…
11111: /32
Post-divider for PLL_PERIPH0 source
/ 7:6
/
/
CPUS_CLK_RATIO 5:4
RW
00
00: /1
01: /2
10: /4
11: /8
Clock divide ratio
/ 3:0
/
/

Known issues

Since the AR100 is based on an extremely old OR1200 commit, any bugs in the CPU core since then can be considered "known issues". This includes:

  • Multiply-accumulate unit (MAC) bugs, fixed by e.g. d24b2173 and 57a449d2.
  • l.fl1 returns the same value as l.ff1 (fixed by 66efe9cd).
  • More multiply/divide bugs, fixed by e.g. bc9b53bc.
  • Arithmetic carry/overflow flags are not implemented (done in 2c0765d7).
  • The "infamous l.rfe fix", in f0255fab.
  • l.lws does not do anything (fixed by 385ffbf3).
  • A bug with filling the instruction cache (fixed by bd5b48dc).
  • l.ror appears to be implemented, even though it is not (fixed by 26febe37).
  • Plus other bugfix commits not explained in the commit message (they point to a now-defunct bugzilla instance).

Other issues found while developing for the AR100 include:

  • The division instructions claim to be implemented but do not work at all. They return either 2 or 10 for all inputs.
  • l.cmov has some undefined effect when present 4 bytes into an instruction cache block (so at address 0x???4) with the instruction cache enabled. It appears to affect later instructions in the pipeline, as if the next few instructions are skipped. Workaround is to not use l.cmov (it's not generated by gcc by default anyway).
  • l.ror does not work, most likely because it is unimplemented. Due to a known bug in the OR1200 (see above), it does not cause an "Illegal Instruction" exception even when unimplemented.
  • All bits in the power management register appear to be ignored. Probably this means the signals from the OR1200 core are not connected to any control logic in the SoC. This makes it impossible to stop or slow down the AR100 CPU when it is idle. Workaround is to control the clock using the register in the PRCM.
  • The programmable interrupt controller (PIC) registers claim to be implemented, but have no effect. No workaround is needed, as all interrupts come in through an external interrupt controller (R_INTC, compatible with the interrupt controller in the A13).

These issues may be due to a (later fixed or still unfixed) bug in the OR1200, modifications made by Allwinner to the OR1200 core, or a silicon bug in the SoC.

Software

Toolchain

Mainline GCC

OpenRISC is an officially-supported architecture as of GCC 9.

This new GCC port requires binutils 2.32 or newer, as it generates assembly with relocation syntax only understood by the newer gas.

Because the or1k instruction set has optional instructions, GCC has flags to tell it which ones your processor supports. That way it will avoid generating code that contains unimplemented instructions. The flags used by the mainline GCC port are different that the ones used by or1k-gcc. The authoritative reference is the GCC source (this file is easy to read). For the AR100, you should use:

-msfimm -mshftimm -msoft-div -msoft-mul

You can use -mhard-mul if you are careful and only do signed multiplication, and make sure the compiler doesn't use l.muli either. However, it's safer to use -msoft-mul and provide a __mulsi3 implementation that uses the l.mul instruction only.

Legacy or1k-gcc

The previous out-of-tree GCC port was known as or1k-gcc. It has not been updated since an experimental GCC 6 release, with the latest stable version being GCC 5.4.0. It is available at https://github.com/openrisc/or1k-gcc; however, that version will never be contributed upstream. The reason is explained in the #openrisc irc log from 2016-11-25:

olofk  Everyone, except for one guy has given permission for copyright assignment.
       Unfortunately, his work is very early in the development, so technically the rest is based upin that
wbx    and this guy is no longer interested in or1k?
olofk  The latest idea we had was to see if the stuff he wrote has actually been replaced by other patches
olofk  He is actually running very much involved, as he works for a company that makes proprietary versions of OpenRISC
wbx    isn't is possible to convince the guy to just offer his code as public domain, so no special fsf agreement required.
olofk  No. His standpoint is that he doesn't want to give up his ownership of the code
olofk  Which of course is just pure fucking bogus
wbx    so he thinks his company benefits from these actions so that toolchain support isn't upstream or what?
wbx    i don't understand such actions from people working with open source.
olofk  Well, me neither. But there is not much more we can do to convince him :/

The forked GCC is still a GPL licensed free software, so using it is perfectly legal (the linux-sunxi wiki provides installation instructions). But packaging a usable OpenRISC toolchain in Linux distributions (such as Debian) is another story because this may involve some political arm wrestling.

Downloads of or1k-gcc binaries are available here and here. The musl toolchain is the smallest, and also works for bare metal development.

On the other hand, binutils, gdb, and newlib all have upstream OpenRISC 1000 support. It is possible to use the latest binutils release with both or1k-gcc and or32-gcc (though some symlinking is necessary in this second case because the architecture names are different).

Original (Obsolete) or32 GCC

The first port, during the GCC 3.x era, used the or32 architecture name. It is still available at an archive of the meansoffreedom.com website (file listing). This toolchain, based on GCC 3.4.4, is the one used by Allwinner to compile their firmware blob. Because it is able to elide function prologues/epilogues, it actually generates smaller code than the newest or1k-gcc release.

Allwinner blob

Allwinner's blob provides a power management API using the message box and spinlock devices, as well as a shared memory area in DRAM. The API is used by Linux (drivers/arisc), U-Boot, and ATF. All three clients have code for loading and starting the firmware blob. An example of the header containing the API definitions is available in the ATF source at plat/sun50iw2p1/include/arisc.h.

Blobs can be found in some BSP Linux kernel source trees, e.g. https://github.com/tinalinux/linux-3.10/tree/r18-v0.9/drivers/arisc/binary. The arisc_*.code files contain both the blob and an encrypted source archive. The blobs can also be found in the BSP as tools/pack/chips/*/bin/scp.bin.

Blob versions

These versions have been extracted from the blobs in smaeul's sunxi-blobs repository. Hopefully these can give some insight into the evolution of Allwinner's blob over time and between SoCs.

Version Blob directory
sun8iw5_v0.03.00-227-gcec3a2b a64/arisc_pine64_new
sun8iw5_v0.03.00-244-gb750b8e h5/arisc_lichee_a64_v3
sun8iw5_v0.03.00-399-g9fac845 sun8iw12p1/arisc_lichee_h6_4.9_beta
v0.1.54 a64/arisc_tinalinux
v0.1.76 a64/arisc_pine64
v0.1.94 a64/arisc_lichee_v2
v0.2.17.1 a64/arisc_lichee_v3
v0.2.23 h5/arisc_lichee_h6_v1rc4
v0.2.30 h6/arisc_lichee_h6_v1rc4
v0.2.83 sun8iw12p1/arisc_lichee_h6_v1rc4
v0.3.27 h3/arisc_lichee_4.4

Decompiling the H3 blob

To ease reverse engineering of the firmware for H3, you can use a script that takes arisc_sun8iw7p1.bin file (available in the lichee H3 sdk from Draco) and produces readable pseudocode. Pseudocode is split into cross-referenced functions and basic blocks; code within basic blocks is emulated; and register assignments use evaluated values if they are known. Memory and register addresses are renamed based on the map of known locations. Most of the functions are named based on their purpose.

Code can be used to understand the suspend/resume function in H3 in particular and write a mainline implementataion.

It is available on github: megous/h3-ar100-firmware-decompiler

Reverse-engineering tools for all blob versions

Another project, sunxi-blobs, provides more generic tools for disassembling and analyzing AR100 firmware blobs (as well as boot ROMs). These scripts can take a firmware dump and generate an annotated disassembly listing as well as an SVG control flow graph (using graphviz). See the project's README for more details.

Community software projects

Information-gathering programs

Power management firmware

Mainline Kernel Support

Resource Sharing

The AR100 and the Arm cores run asynchronously, and can access the same memory and registers. In order to prevent the AR100 from corrupting the state of the ARM cores, one of two things needs to be done:

  1. IF the AR100 ONLY accesses chip resources that are otherwise unused from Linux, it can do so safely at any time.
  2. IF the AR100 needs to access shared resources (such as GPIO) then it needs to synchronise with the kernel, and the kernel needs to be aware of it.

To synchronise with the kernel, the chips implement hardware spinlocks. There is a kernel driver for these spinlocks, but it has not yet been finally submitted upstream due to a lack of testing.

Getting this driver in mainline should be considered essential for any use of the AR100 as a realtime coprocessor. At the minimum, hardware spinlocks should protect GPIO Writes, and GPIO Direction. Other GPIO states, such as function, should be initialised by the kernel before the AR100 firmware is started. There should be little need to reinitialise GPIO modes inside AR100 code.

For simplicity of implementation of AR100 firmware, it is suggested that a set of known Hardware Spinlocks be defined in a kernel header file. AR100 firmware should then compile against this header, which would ensure synchronisation of the firmware with the kernel version. The predefined hardware spinlocks are initialised early by the kernel, during board initialisation using:

struct hwspinlock *hwspin_lock_request_specific(unsigned int id);

See [1] for details.


Documentation

Links

Notes